The entire goal of my project was to predict when an NBA player will have his peak season.
I did this entirely based on PER, or Player Efficiency Rating. My
initial plan was to come up with a model or new way to calulate
something very similar to this, but I didn’t know exactly how
complicated and accurate PER was. Here is a picture of the formula
analysts use to find PER. Pretty dang
complicated right?
The data was already organized fairly well but I did have to select what data I wanted after I finished tidying.
I filtered the data for multiple reasons.
First I only wanted data from seasons 1997-2023 because before 1997, the data wasn’t complete, and I didn’t include the 2024 season because it is the season currently going on right now, and does not have complete data either.
I filtered to minutes played greater than 1500 a season to filter out bench players, which would skew our PER predictions later. This is players playing an average of 18-24 minutes a game. This is to give a buffer for injuries to starters as well.
Another filter was that a player needed to play more than 30 games for similar reasons as the minutes.
I also had to filter on position played, because multiple players had multiple positions played.
Almost all variables I tested with PER ended up being
strongly correlated, however here are some results I found that are
pretty interesting. This one compares PER on the percentage of attempts
a player shoots from the three point line. Surprisingly the more three
point shots that centers take the greater their PER whereas other
positions PER decreases dramatically.
This graph is cool because we see that over time, Centers and Point
Guards have gotten “better” or have a larger average PER. Power
Forwards, Small Forwards, and Shooting Guards all have gotten “worse” or
have a lower average PER.
I made several models and as you can see from the comparison graph below, my model 4 seems to fit my data the best to predict PER.
The PER formula for m4 is experience + g + mp + ts_percent + x3p_ar +
f_tr + orb_percent + drb_percent + trb_percent + ast_percent +
stl_percent + blk_percent+ tov_percent + usg_percent + ows + dws+ ws+
ws_48 + obpm + dbpm + bpm + vorp
Here is a graph that is the PER for players that have 8 years
experience. The graph shows their PER for each season they have played
up to their 8th, and a trend line for each player.
This is just a sample of what the data looks like.
I experimented with using the library caret, which stands for
Classification and REgression Training, to increase my knowledge in
model training. I set a random seed to split my data into training and
testing sets, set up a linear model with the formula from model 4, and
predicted the peak season for each NBA player.
This first graph is a really good representation of how accurate my
predictions are compared to the last.
This graph is a really nice and easy to read indicator on how far off
my prediction was, and how often.
I also learned about the plotly package, which is primarily used for interactive data visualization.
| column | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| player_id | 6 | 2986.166667 | 16.773988 | 2988 | 2986.166667 | 14.5 | 2967 | 3004 | 37 | -0.0952759 | 1.198705 | 6.847952 |
| max_PER_experience | 6 | 5.333333 | 2.875181 | 5 | 5.333333 | 0.5 | 1 | 10 | 9 | 0.1884518 | 2.904787 | 1.173788 |
| first_experience | 6 | 1.000000 | 0.000000 | 1 | 1.000000 | 0.0 | 1 | 1 | 0 | NaN | NaN | 0.000000 |
| experience_difference | 6 | 4.333333 | 2.875181 | 4 | 4.333333 | 0.5 | 0 | 9 | 9 | 0.1884518 | 2.904787 | 1.173788 |